Statistical Analysis of Vietnamese Dialect Corpus and Dialect Identification Experiments
نویسندگان
چکیده
The performance of speech recognition systems will be improved if the corpus is organized in the specialized domain and is applied in a consistent way for speech recognition in specific situations. Vietnamese dialects are various. The building of corpus for Vietnamese dialect is the first step for implementing the system of dialect identification used for increasing the performance of Vietnamese recognition in general. This paper presents a method of building a corpus for Vietnamese dialect identification. Vietnamese corpus VDSPEC is built with topic-based recording and tonal balance. The duration of the corpus is 45.12 hours in total. The basic characteristics and preliminary evaluations of the corpus are also described. The statistical analysis of F0 variation and experiments on the classification of dialects using LDA projection showed that there are distinctions of pronunciation modality of Vietnamese for three dialects Hanoi, Hue and Ho Chi Minh city. For experiments on Vietnamese dialect identification, the first four formants, their bandwidths, and F0 variants have been used as input parameters for GMM. The experiment results for the dialect corpus of Vietnamese shows that the recognition rate is 66.3% without F0 information and this recognition rate increases to 72.2% with F0 information.
منابع مشابه
The effect of first language (L1) dialects on the identification of Vietnamese word-final stops
This study examined the extent to which speakers’ first language (L1) dialect affects the identification of word-final stops in Vietnamese. Stops in the word-final position are unreleased in Vietnamese. Further, there is a /t/-/k/ merger in the Southern, but not the Northern dialect. We tested the hypothesis that the stop tokens produced in the Southern dialect are identified less accurately th...
متن کاملArabic Dialect Identification Using a Parallel Multidialectal Corpus
We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a meta-classifier using stacked generalization – a method not previously a...
متن کاملGlobalization, Standardization, and Dialect Leveling in Iran
This paper is an attempt to shed light on the effects of modernization, urbanization, monolingual educational system, and mass media as well as the process of globalization on dialect leveling among Persian dialects. In so doing, the first part of the paper elaborates on the relationship between globalization and sociolinguistics, and on the concept of standardization. Also, it discusses some ...
متن کاملDialect experience in Vietnamese tone perception.
This study investigated the perceptual dimensions of tone in Vietnamese and the effect of dialect experience on listener's prelinguistic perception of tone. While Northern Vietnamese tones are cued by a combination of pitch and voice quality, Southern Vietnamese tones are purely pitch based. 30 listeners from two Vietnamese dialects (10 Northern, 20 Southern) participated in a speeded AX discri...
متن کاملAdvances in Word based Dialect/
In an earlier study, we proposed a very effective dialect/accent classification algorithm, which is named Word based Dialect Classification (WDC). The WDC works well for large size corpora and significantly outperforms traditional Large Vocabulary Continuous Speech Recognition (LVCSR) based systems, which is claimed to be the best performing system for language identification. For a small train...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016